209 research outputs found
Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates
We establish optimal convergence rates for a decomposition-based scalable
approach to kernel ridge regression. The method is simple to describe: it
randomly partitions a dataset of size N into m subsets of equal size, computes
an independent kernel ridge regression estimator for each subset, then averages
the local solutions into a global predictor. This partitioning leads to a
substantial reduction in computation time versus the standard approach of
performing kernel ridge regression on all N samples. Our two main theorems
establish that despite the computational speed-up, statistical optimality is
retained: as long as m is not too large, the partition-based estimator achieves
the statistical minimax rate over all estimators using the set of N samples. As
concrete examples, our theory guarantees that the number of processors m may
grow nearly linearly for finite-rank kernels and Gaussian kernels and
polynomially in N for Sobolev spaces, which in turn allows for substantial
reductions in computational cost. We conclude with experiments on both
simulated data and a music-prediction task that complement our theoretical
results, exhibiting the computational and statistical benefits of our approach
Randomized Smoothing for Stochastic Optimization
We analyze convergence rates of stochastic optimization procedures for
non-smooth convex optimization problems. By combining randomized smoothing
techniques with accelerated gradient methods, we obtain convergence rates of
stochastic optimization procedures, both in expectation and with high
probability, that have optimal dependence on the variance of the gradient
estimates. To the best of our knowledge, these are the first variance-based
rates for non-smooth optimization. We give several applications of our results
to statistical estimation problems, and provide experimental results that
demonstrate the effectiveness of the proposed algorithms. We also describe how
a combination of our algorithm with recent work on decentralized optimization
yields a distributed stochastic optimization algorithm that is order-optimal.Comment: 39 pages, 3 figure
Differentially Private Model Selection with Penalized and Constrained Likelihood
In statistical disclosure control, the goal of data analysis is twofold: The
released information must provide accurate and useful statistics about the
underlying population of interest, while minimizing the potential for an
individual record to be identified. In recent years, the notion of differential
privacy has received much attention in theoretical computer science, machine
learning, and statistics. It provides a rigorous and strong notion of
protection for individuals' sensitive information. A fundamental question is
how to incorporate differential privacy into traditional statistical inference
procedures. In this paper we study model selection in multivariate linear
regression under the constraint of differential privacy. We show that model
selection procedures based on penalized least squares or likelihood can be made
differentially private by a combination of regularization and randomization,
and propose two algorithms to do so. We show that our private procedures are
consistent under essentially the same conditions as the corresponding
non-private procedures. We also find that under differential privacy, the
procedure becomes more sensitive to the tuning parameters. We illustrate and
evaluate our method using simulation studies and two real data examples
Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models
Motivated by a real-life problem of sharing social network data that contain
sensitive personal information, we propose a novel approach to release and
analyze synthetic graphs in order to protect privacy of individual
relationships captured by the social network while maintaining the validity of
statistical results. A case study using a version of the Enron e-mail corpus
dataset demonstrates the application and usefulness of the proposed techniques
in solving the challenging problem of maintaining privacy \emph{and} supporting
open access to network data to ensure reproducibility of existing studies and
discovering new scientific insights that can be obtained by analyzing such
data. We use a simple yet effective randomized response mechanism to generate
synthetic networks under -edge differential privacy, and then use
likelihood based inference for missing data and Markov chain Monte Carlo
techniques to fit exponential-family random graph models to the generated
synthetic networks.Comment: Updated, 39 page
MRI-based Surgical Planning for Lumbar Spinal Stenosis
The most common reason for spinal surgery in elderly patients is lumbar
spinal stenosis(LSS). For LSS, treatment decisions based on clinical and
radiological information as well as personal experience of the surgeon shows
large variance. Thus a standardized support system is of high value for a more
objective and reproducible decision. In this work, we develop an automated
algorithm to localize the stenosis causing the symptoms of the patient in
magnetic resonance imaging (MRI). With 22 MRI features of each of five spinal
levels of 321 patients, we show it is possible to predict the location of
lesion triggering the symptoms. To support this hypothesis, we conduct an
automated analysis of labeled and unlabeled MRI scans extracted from 788
patients. We confirm quantitatively the importance of radiological information
and provide an algorithmic pipeline for working with raw MRI scans
- …